Duplicate content can result from many causes, including licensing
of content to or from your site, site architecture flaws due to
non-SEO-friendly CMSs, or plagiarism. Over the past five years, however,
spammers in desperate need of content began the now much-reviled process
of scraping content from legitimate sources, scrambling the words (through
many complex processes), and repurposing the text to appear on their own
pages in the hopes of attracting long tail searches and serving contextual
ads (and various other nefarious purposes).Thus, today we’re faced with a world of “duplicate content issues”
and “duplicate content penalties.” Here are some definitions that are
useful for this discussion:
Unique content
This is written by humans, is completely different from any
other combination of letters, symbols, or words on the Web, and is
clearly not manipulated through computer text-processing algorithms
(such as Markov-chain-employing spam tools).
Snippets
These are small chunks of content such as quotes that are
copied and reused; these are almost never problematic for search
engines, especially when included in a larger document with plenty
of unique content.
Shingles
Search engines look at relatively small phrase segments (e.g.,
five to six words) for the presence of the same segments on other
pages on the Web. When there are too many shingles in common between
two documents, the search engines may interpret them as duplicate
content.
Duplicate content issues
This is typically used when referring to duplicate content
that is not in danger of getting a website penalized, but rather is
simply a copy of an existing page that forces the search engines to
choose which version to display in the index (a.k.a. duplicate
content filter).
Duplicate content filter
This is when the search engine removes substantially similar
content from a search result to provide a better overall user
experience.
Duplicate content penalty
Penalties are applied rarely and only in egregious situations.
Engines may devalue or ban other web pages on the site, too, or even
the entire website.
1. Consequences of Duplicate Content
Assuming your duplicate content is a result of innocuous
oversights on your developer’s part, the search engine will most likely
simply filter out all but one of the pages that are duplicates because
the search engine wants to display one version of a particular piece of
content in a given SERP. In some cases, the search engine may filter out
results prior to including them in the index, and in other cases the
search engine may allow a page in the index and filter it out when it is
assembling the SERPs in response to a specific query. In this latter
case, a page may be filtered out in response to some queries and not
others.
Searchers want diversity in the results, not the same results
repeated again and again. Search engines therefore try to filter out
duplicate copies of content, and this has several consequences:
A search engine bot comes to a site with a crawl budget, which
is counted in the number of pages it plans to crawl in each
particular session. Each time it crawls a page that is a duplicate
(which is simply going to be filtered out of search results) you
have let the bot waste some of its crawl budget. That means fewer of
your “good” pages will get crawled. This can result in fewer of your
pages being included in the search engine index.
Links to duplicate content pages represent a waste of link
juice. Duplicated pages can gain PageRank, or link juice, and since
it does not help them rank, that link juice is misspent.
No search engine has offered a clear explanation for how its
algorithm picks which version of a page it does show. In other
words, if it discovers three copies of the same content, which two
does it filter out? Which one does it still show? Does it vary based
on the search query? The bottom line is that the search engine might
not favor the version you wanted.
Although some SEO professionals may debate some of the preceding
specifics, the general structure will meet with near-universal
agreement. However, there are a couple of problems around the edge of
this model.
For example, on your site you may have a bunch of product pages
and also offer print versions of those pages. The search engine might
pick just the printer-friendly page as the one to show in its results.
This does happen at times, and it can happen even if the
printer-friendly page has lower link juice and will rank less well than
the main product page.
The fix for this is to apply the canonical URL tag to all versions of the page
to indicate which version is the original.
A second version of this can occur when you syndicate content to
third parties. The problem is that the search engine may boot your copy
of the article out of the results in favor of the version in use by the
person republishing your article. The best fix for this, other than
NoIndexing the copy of the article
that your partner is using, is to have the partner implement a link back
to the original source page on your site. Search engines nearly always
interpret this correctly and emphasize your version of the content when
you do that.